On Entropy-Compressed Text Indexing in External Memory
نویسندگان
چکیده
A new trend in the field of pattern matching is to design indexing data structures which take space very close to that required by the indexed text (in entropy-compressed form) and also simultaneously achieve good query performance. Two popular indexes, namely the FM-index [Ferragina and Manzini, 2005] and the CSA [Grossi and Vitter 2005], achieve this goal by exploiting the Burrows-Wheeler transform (BWT) [Burrows and Wheeler, 1994]. However, due to the intricate permutation structure of BWT, no locality of reference can be guaranteed when we perform pattern matching with these indexes. Chien et al. [2008] gave an alternative text index which is based on sparsifying the traditional suffix tree and maintaining an auxiliary 2-D range query structure. Given a text T of length n drawn from a σ-sized alphabet set, they achieved O(n log σ)-bit index for T and showed that this index can preserve locality in pattern matching and hence is amenable to be used in external-memory settings. We improve upon this index and show how to apply entropy compression to reduce index space. Our index takes O(n(Hk + 1)) + o(n log σ) bits of space where Hk is the kth-order empirical entropy of the text. This is achieved by creating variable length blocks of text using arithmetic coding.
منابع مشابه
An Algorithmic Framework for Compression and Text Indexing
We present a unified algorithmic framework to obtain nearly optimal space bounds for text compression and compressed text indexing, apart from lower-order terms. For a text T of n symbols drawn from an alphabet Σ, our bounds are stated in terms of the hth-order empirical entropy of the text, Hh. In particular, we provide a tight analysis of the Burrows-Wheeler transform (bwt) establishing a bou...
متن کاملCompression, Indexing, and Retrieval for Massive String Data
The field of compressed data structures seeks to achieve fast search time, but using a compressed representation, ideally requiring less space than that occupied by the original input data. The challenge is to construct a compressed representation that provides the same functionality and speed as traditional data structures. In this invited presentation, we discuss some breakthroughs in compres...
متن کاملEntropy-Compressed Indexes for Multidimensional Pattern Matching
In this talk, we will discuss the challenges involved in developing a multidimensional generalizations of compressed text indexing structures. These structures depend on some notion of Burrows-Wheeler transform (BWT) for multiple dimensions, though naive generalizations do not enable multidimensional pattern matching. We study the 2D case to possibly highlight combinatorial properties that do n...
متن کاملPractical Dynamic Entropy-Compressed Bitvectors with Applications
Succinct/compressed data structures aim at providing the same functionality offered by classical data structures while using asymptotically less space. There exist several of these structures for a wide spectrum of applications ranging from strings over arbitrary alphabets to full-text indexing. Their theoretical promises have been met in practice in the static scenario, while the practicality ...
متن کاملSelf-Indexing XML
Self-indexing is a technology that integrates text compression and text indexing, such that a text collection can be simultaneously compressed and indexed. The resulting representation, called a self-index of the text, takes space close to that of the compressed text, is able of reproducing any text substring, and oers indexed searching of the collection. This has been a major breakthrough in t...
متن کامل